A Contextual Post-processing Model for Korean OCR using Synthesized Statistical Information

نویسندگان

Young-Sook Hwang

Bong-Rae Park

Hae-Chang Rim

Seong-Whan Lee

چکیده

In this paper, we describe a contextual Korean OCR post-processing model considering unknown words. This work starts from the following premises: 1) In the language having very large character set, it is hard to directly correct erroneous string; 2) word formation is deeply related not only to morphological feature but also to phonological feature(esp. syllable combination on the surface level); 3) OCR post-processing system must rely on lexical and contextual information to correct OCR errors accurately. Based on the premises, the proposed system is composed of following three modules. First, it generates candidate words and evaluates the conndence of each candidate. Second, it selects feasible candidates by evaluating the word possibility of each candidate with its corresponding syllable connectivity and morphotactic connectivity. Third, it analyzes the contextual association among words and selects the most feasible word by the statistical information synthesizing the conndence weight, word possibility and contextual association strength of each candidate. Experimental results show that the proposed system can improve the performance of an OCR system from 94.1 to 97.6% by character-unit and from 87.6 to 97.1% by word-unit. In addition, this system can recognize 87.4% of unknown words.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Multi-level post-processing for Korean character recognition using morphological analysis and linguistic evaluation

Most of the post-processing methods for character recognition rely on contextual information of character and word-fragment levels. However, due to linguistic characteristics of Korean, such low-level information alone is not sufficient for high-quality character-recognition applications, and we need much higher-level contextual information to improve the recognition results. This paper present...

متن کامل

An OCR Post-processing Approach Based on Multi-knowledge

This paper proposes an OCR post-processing approach based on multi-knowledge, which integrates language knowledge and candidate distance information given by the OCR engine. In this approach, statistical language model and semantic lexicon are combined, and candidate distance information is used to reduce the size of the search space. The experimental results show that this approach is very eff...

متن کامل

A post-processor for Gurmukhi OCR

A post-processing system for OCR of Gurmukhi script has been developed. Statistical information of Punjabi language syllable combinations, corpora look-up and certain heuristics based on Punjabi grammar rules have been combined to design the post-processor. An improvement of 3% in recognition rate, from 94.35% to 97.34%, has been reported on clean images using the post-processing techniques.

متن کامل

Statistical Learning for OCR Text Correction

The accuracy of Optical Character Recognition (OCR) is crucial to the success of subsequent applications used in text analyzing pipeline. Recent models of OCR post-processing significantly improve the quality of OCR-generated text, but are still prone to suggest correction candidates from limited observations while insufficiently accounting for the characteristics of OCR errors. In this paper, ...

متن کامل

Efficient OCR Post-Processing Combining Language, Hypothesis and Error Models

In this paper, an OCR post-processing method that combines a language model, OCR hypothesis information and an error model is proposed. The approach can be seen as a flexible and efficient way to perform Stochastic Error-Correcting Language Modeling. We use Weighted Finite-State Transducers (WFSTs) to represent the language model, the complete set of OCR hypotheses interpreted as a sequence of ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

A Contextual Post-processing Model for Korean OCR using Synthesized Statistical Information

نویسندگان

چکیده

منابع مشابه

Multi-level post-processing for Korean character recognition using morphological analysis and linguistic evaluation

An OCR Post-processing Approach Based on Multi-knowledge

A post-processor for Gurmukhi OCR

Statistical Learning for OCR Text Correction

Efficient OCR Post-Processing Combining Language, Hypothesis and Error Models

عنوان ژورنال:

اشتراک گذاری